140        Bioinformatics

dataset from validated data resources, such as 1000 Genomes, OMNI, and ­hapmap, and

then it uses the model to filter out the putative artifacts from the called variants. The

­application of the model results in assigning a log-odds ratio score (VQSLOD) for each

variant that measures how likely that variant is real based on the data used in the training.

The VQSLOD is added to the INFO field of the variant. The variants are then filtered based

on a threshold. SNPs and InDels are recalibrated separately. The variant calibration and

filtering are performed in two steps:

(i) Building of the recalibration model:

The recalibration model is built using VariantRecalibrator tool. The input file for this

tool is the variants to be recalibrated “-V” and the known training dataset “--resource”.

The latter must be downloaded from a reliable source such as GATK resource bundle. The

fitted model is used to estimate the relationship between the probability that whether a

variant is true or artifact and continuous covariates that include QD (quality depth), MQ

(Mapping quality), and FS (FisherStrand). The VQSLOD is estimated based on Gaussian

mixture model whether a variant is true versus being false. Each variant in the input VCF

file is assigned a VQSLOD in INFO field of the VCF file and the variants are ranked by

VQSLOD. A tranche sensitivity threshold can be provided in “-tranche” as a percentage.

Several thresholds can be set. The output of this step is a recalibrated VCF file and other

files including tranches, which will be used by ApplyVQSR, and plot files.

cd refvcf

wget https://storage.googleapis.com/genomics-public-data/

resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf

wget https://storage.googleapis.com/genomics-public-data/

resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx

cd ..

mkdir VQSR

cd vcf

~/software/gatk-4.2.3.0/gatk --java-options \

-Xmx10g VariantRecalibrator \

-R ../refgenome/Homo_sapiens_assembly38.fasta \

-V allsamplesSNP_chr21.vcf \

--trust-all-polymorphic \

-tranche 100.0 \

-tranche 99.95 \

-tranche 99.90 \

-tranche 99.85 \

-tranche 99.80 \

-tranche 99.00 \

-tranche 98.00 \

-tranche 97.00 \

-tranche 90.00 \

--max-gaussians 6 \

--resource:1000G,known=false,training=true,truth=true,prior=10.0

\